Draft genome of Castanopsis chinensis, a dominant species safeguarding biodiversity in subtropical broadleaved evergreen forests

Objectives Castanopsis is the third largest genus in the Fagaceae family and is essentially tropical or subtropical in origin. The species in this genus are mainly canopy-dominant trees, and the key components of evergreen broadleaved forests play a crucial role in the maintenance of local biodiversity. Castanopsis chinensis, distributed from South China to Vietnam, is a representative species. It currently suffers from a high disturbance of human activity and climate change. Here, we present its assembled genome to facilitate its preliminary conservation and breeding on the genome level. Data description The C. chinensis genome was assembled and annotated by Nanopore and MGI whole-genome sequencing and RNA-seq reads using leaf tissues. The assembly was 888,699,661 bp in length, consisting of 133 contigs and a contig N50 of 23,395,510 bp. A completeness assessment of the assembly with Benchmarking Universal Single-Copy Orthologs (BUSCO) indicated a score of 98.3%. Repetitive elements comprised 471,006,885 bp, accounting for 55.9% of the assembled sequences. A total of 51,406 genes that coded for 54,310 proteins were predicted. Multiple databases were used to functionally annotate the protein sequences.


Objective
Castanopsis is the third largest genus in the Fagaceae family [1].It includes about 120-130 species in the genus [1][2][3].Fossil evidence indicates that Castanopsis is widely distributed in both the Northern and Southern Hemisphere through the Eocene to the Pliocene in history [2,4], but currently, it is mainly distributed in the subtropics and tropics of East and Southeast Asia [2,4].Castanopsis species are mainly canopy-dominant trees and can grow up to 25-40 m in height [5].Therefore, they are the main components of evergreen broadleaved forests, safeguarding local biodiversity [2,6].Castanopsis species are good timber trees, and their seeds are edible [2,3,7].They also contain many polyphenols and are used in traditional medicines [3].Climate change is the main threat to Draft genome of Castanopsis chinensis, a dominant species safeguarding biodiversity in subtropical broadleaved evergreen forests Pan Chen 1 , Ju-Yu Lian 2,3,4* , Bin Wu 1 , Hong-Lin Cao 2,3,4 , Zhi-Hong Li 1 and Zheng-Feng Wang 2,3,4*  Castanopsis species due to its restricted migration ability [3,7].By studying 32 dominant Castanopsis species in East Asia that grow from 5°N to 38°N, it has been predicted that their present high richness distribution range will be reduced by 94.5%, on average, by 2070 [3].
China is a central region of the Castanopsis distribution and includes approximately 60 species [3], half of which are endemic to China [1].Castanopsis chinensis is distributed from South China to Vietnam.It is a pioneerdominant canopy tree in evergreen broad-leaved forests and plays a key role in ecosystems [3,8].As a fast-growing and soil erosion-controlling species, it is also widely used in reforestation [7].Because the C. chinensis distribution area suffers from disturbances of high human activity, most forests have been converted or degraded.Therefore, we present here the C. chinensis genome to better understand its evolution and adaptation and enhance its conservation, management and utility in the future.

Data description
We collected leaf samples of C. chinensis from a individual planted in the South China Botanical Garden, Guangzhou China.To perform genome assembly and annotation, three sequencing libraries were constructed from genomic DNA or RNA extracted from the leaf tissues.The first library was constructed by long read whole genome sequencing using a Nanopore PromethION sequencer, which generated about 139.5 GB of data (Data files 1-3) [9][10][11].The second was generated by short read whole genome sequencing using an MGI DNBSEQ-T7 sequencer, which generated about 149.6 GB data (Data file 4) [12], and the third was generated by RNA sequencing (RNA-seq) using an MGI DNBSEQ-T7 sequencer, which generated about 29.7 GB data (Data file 5) [13].All sequencing using the MGI platform was applied using 150 bp paired-end mode.

Limitations
Although the genome size estimators of KmerGenie and GenomeScope were highly discrepant, yielding 1,143,475,699 and 744,772,109 bp, the final assembly size of 888,699,661 bp was comparable to previously reported genome sizes for Castanopsis species, including 878.6 Mb for C. tibetana [1] and 882.6 Mb for C. hystrix [41], and higher than 785.5 Mb for C. mollissima [42].It has been reported that accurate genome size estimation with short reads is challenging [43,44].Therefore, long HiFi sequencing data may be further needed to obtain an accurate size estimation [44].Due to their long length and high accuracy, HiFi date could character genome size reliably both in small and large k-mers [43,[45][46][47][48], which help determining the true result when the discrepancy happening in genome size estimation by short reads [43].
Currently, the assembled genome in this report is still fragmented.Therefore, it is not suitable for complete genome structure analysis, hindering the complex regions digging in its conservation and breeding [49][50][51].Further high-quality genome assemblies (preferably complete and gapless) using ultra-long read, Hi-C, and other sequencing technologies are needed [52].